license: “CC BY-NC”
Creative Commons: Attribution-NonCommerical
https://creativecommons.org/licenses/by-nc/4.0/
The following procedural example is based on documentation found at the rvest documentation site.
read_htmlresults <- read_html("http://www.vondel.humanities.uva.nl/ecartico/persons/index.php?subtask=browse")
The results object is a list (R data type.) The items in the list correspond to the basic document structure of an HTML document…
Displaying the results object shows that the first item in the list is head. The second item is body. These items correspond to the basic structure of the HTML document type definition. In other words, the text, links, and HTML “stuff” were scraped from the web page. Specifically this stuff is found in the body element of the HTML document. This stuff is now stored in the body element of the restults list.
Contents of the results object
results
## {html_document}
## <html lang="en" prefix="foaf: http://xmlns.com/foaf/0.1/ owl: http://www.w3.org/2002/07/owl# schema: http://schema.org/ time: http://www.w3.org/2006/time# skos: http://www.w3.org/2004/02/skos/core# rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# bio: http://purl.org/vocab/bio/0.1/ wdps: http://www.wikidata.org/prop/statement/ ecartico: http://www.vondel.humanities.uva.nl/ecartico/lod/vocab/# rdfs: http://www.w3.org/2000/01/rdf-schema# pnv: https://w3id.org/pnv# sem: http://semanticweb.cs.vu.nl/2009/11/sem/ ogcgs: http://www.opengis.net/ont/geosparql# ogcsf: http://www.opengis.net/ont/sf#" id="description">
## [1] <head about="#description" typeof="schema:CreativeWork schema:DataFeedIte ...
## [2] <body>\n\n\n\n<div id="projecttitle">\n\n<div id="projecttop">\n<a href=" ...
Example HTML
A simplified example HTML document
<HTML>
<HEAD>
<title>my example document</title>
</HEAD>
<BODY>
<h1>Hello World</h1>
<p>HTML is a tagging system known as the HypterText Markup Language</p>
</BODY>
</HTML>
The basic workflow of web scraping is
Development
Production
Iterate
A web page is composed of HTML and prose. The web document, just as the web site, has a hierarchical structure. Web scraping means parsing these structures to gather needed information.
The first step is to start with a single target single document (i.e. a web page or a leaf of the site). In this case, the document we want to parse is the summary navigation page consisting of the first 50 names listed alphabetically in this web site. The goal is to parse the HTML source of that web page (i.e. document) by traversing the nodes of the document’s HTML structure. In other words, we want to mine text and data from the body section of the results list. In this example I’ll gather all the HTML within the <li> tags.
li stands for “list item”. You can learn more about the li tag structure from HTML documentation.
body of the HTML document tree. This is done with the html_nodes() function.html_nodes("li")html_text() function to parse the text of the HTML list item (i.e. the <li> tag)in an HTML document that has tagging such as this:
<li><a href="/ecartico/persons/17296">Anna Aaltse (1715 - 1738)</a></li>
I want to gather the text within the <li> tag: e.g. Anna Aaltse (1715 - 1738)
You can use the Selector Gadget to help you identify the HTML/CSS tags and codes.
Using the html_nodes() and html_text() functions, I can retrieve all the text within <li></li> tags.
names <- results %>%
html_nodes("#setwidth li a") %>%
html_text()
names
## [1] "Hillebrand Boudewynsz. van der Aa (1661 - 1717)"
## [2] "Boudewijn Pietersz van der Aa (? - ?)"
## [3] "Pieter Boudewijnsz. van der Aa (1659 - 1733)"
## [4] "Boudewyn van der Aa (1672 - ?)"
## [5] "Machtelt van der Aa (? - ?)"
## [6] "Claas van der Aa I (? - ?)"
## [7] "Claas van der Aa II (? - ?)"
## [8] "Willem van der Aa (? - ?)"
## [9] "Hans von Aachen (1552 - 1615)"
## [10] "Jacobus van Aaken (? - ?)"
## [11] "Justus van Aaken (? - ?)"
## [12] "Johannes Aalmis (1714 - 1799)"
## [13] "Johan Bartholomeus Aalmis (1723 - 1786)"
## [14] "Maria van Aalst (1639 - 1664)"
## [15] "Anna Aalst (? - ?)"
## [16] "Anna Aaltse (1715 - 1738)"
## [17] "Allart Aaltsz (1665 - 1748)"
## [18] "Geertruy Aaltsz (? - 1732)"
## [19] "Maria Aaltsz (? - 1746)"
## [20] "Catharina Aaltsz (? - 1727)"
## [21] "Nikolaas van Aaltwijk (1692 - 1727)"
## [22] "Maria Aams (1711 - 1774)"
## [23] "Jacobus Aams (1680 - ?)"
## [24] "Jan Govertsz. van der Aar (1544 - 1612)"
## [25] "Anna van der Aar (1576 - 1656)"
## [26] "Janneke Jans van Aarden (1609 - 1651)"
## [27] "Abraham van Aardenberg (1672 - 1717)"
## [28] "Willem Aardenhout I (? - ?)"
## [29] "Margrietje Aarlincx (1637 - 1690)"
## [30] "Dirck van Aart (1680 - 1737)"
## [31] "Jonas Abarbanel (? - 1667)"
## [32] "Josephus Abarbanel (? - ?)"
## [33] "Esther Abarbanel (? - ?)"
## [34] "Rachel Abarbanel (? - ?)"
## [35] "Lea Abarbanel (1691 - ?)"
## [36] "Isaac Abarbanel (1637 - 1723)"
## [37] "Damiana Abarca (? - 1630)"
## [38] "Bartholomeus Abba (1641 - 1684)"
## [39] "Cornelis Dirksz. Abba (1604 - 1675)"
## [40] "Clara Abba (1631 - 1671)"
## [41] "Aerlant Abbas (1606 - 1696)"
## [42] "Matheus Jansz Abbas (1569 - ?)"
## [43] "Hendrik Abbé (1639 - 1677)"
## [44] "Claude Abbé (? - 1653)"
## [45] "Simon Jan Pontenz. Abbe (1467 - 1549)"
## [46] "Simon IJsbrandz. Abbe (? - ?)"
## [47] "Ysbrandt Simonsz. Abbe (? - 1559)"
## [48] "Maximiliaen l' Abbé (? - 1675)"
## [49] "Marten Simonsz. Abbe genaamd Schuyt (? - 1592)"
## [50] "Daniël Abbeloos (ca. 1635 - 1677)"
Beyond the text you may also want attributes of HTML tags. To mine the URL of a hypertext link <a href="URL"></a>, within a list item, you need to parse the HREF argument of an anchor tag. If you’re new to web scraping, you’re going to need to learn something about HTML tags, such as the anchor tag.
in an HTML document that has tagging such as this:
<a href="https://search.com">Example Link</a>
I want to gather the value of the href attribute within the anchor tag: https://search.com
Using the html_nodes() and html_attr() functions, I can retrieve all the attribute values within <li><a></a></li> tags.
url <- results %>%
html_nodes("#setwidth li a") %>%
html_attr("href")
url
## [1] "../persons/414" "../persons/10566" "../persons/10567" "../persons/10568"
## [5] "../persons/27132" "../persons/33780" "../persons/33781" "../persons/33782"
## [9] "../persons/9203" "../persons/33052" "../persons/33053" "../persons/43671"
## [13] "../persons/43672" "../persons/30222" "../persons/38845" "../persons/17296"
## [17] "../persons/38518" "../persons/38523" "../persons/38524" "../persons/38525"
## [21] "../persons/43337" "../persons/42619" "../persons/42620" "../persons/41311"
## [25] "../persons/49902" "../persons/20653" "../persons/47922" "../persons/33783"
## [29] "../persons/42051" "../persons/28921" "../persons/37887" "../persons/37890"
## [33] "../persons/37892" "../persons/38352" "../persons/42876" "../persons/42881"
## [37] "../persons/22859" "../persons/52962" "../persons/52963" "../persons/52965"
## [41] "../persons/17297" "../persons/55241" "../persons/416" "../persons/11593"
## [45] "../persons/41739" "../persons/41742" "../persons/41743" "../persons/52649"
## [49] "../persons/41738" "../persons/417"
Note that the above links, or hrefs, are relative URL paths. I still need the domain name for the web server http://www.vondel.humanities.uva.nl.
Above I created two vectors, one vector, names, is the html_text that I parsed from the <li> tags within the <body> of the HTML document. The other vector, url, is a vector of the values of the href attribute of the anchor <a> tags.
Placing both vectors into a tibble makes manipulation easier when using tidyverse techniques.
Goal
I want to develop a systematic workflow that builds a tibble consisting of each of the 50 names listed in the summary results object retrieved by read_html() performed on this page. Of course, mining and parsing the data is just the beginning. Data cleaning is a vital and constant aspect of web scraping.
Using vectors from parsing functions above….
results_df <- tibble(names, url)
results_df
## # A tibble: 50 x 2
## names url
## <chr> <chr>
## 1 Hillebrand Boudewynsz. van der Aa (1661 - 1717) ../persons/414
## 2 Boudewijn Pietersz van der Aa (? - ?) ../persons/10566
## 3 Pieter Boudewijnsz. van der Aa (1659 - 1733) ../persons/10567
## 4 Boudewyn van der Aa (1672 - ?) ../persons/10568
## 5 Machtelt van der Aa (? - ?) ../persons/27132
## 6 Claas van der Aa I (? - ?) ../persons/33780
## 7 Claas van der Aa II (? - ?) ../persons/33781
## 8 Willem van der Aa (? - ?) ../persons/33782
## 9 Hans von Aachen (1552 - 1615) ../persons/9203
## 10 Jacobus van Aaken (? - ?) ../persons/33052
## # ... with 40 more rows
From above we have links in a url vector, and target names in a names vector, for fifty names from the target website we want to crawl. Of course you also want to parse data for each person in the database. To do this we need to read (i.e. read_html()) the HTML for each relevant url in the results_df tibble. Below is an example of how to systematize the workflow. To do that, we’ll make a results tibble, results_df. But first, more data cleaning…
Create some new variables with mutate. Build a full URL from the relative URL path (i.e. the url vector) and the domain or base URL of the target site. Since we scraped the relative URL path, we have to construct a full URL.
urls_to_crawl_df <- results_df %>%
mutate(url = str_replace(url, "\\.\\.", "ecartico")) %>% # fixing a relative path reference by replacing '..' with 'ecartico'
mutate(full_url = glue::glue("http://www.vondel.humanities.uva.nl/{url}")) %>%
# mutate(full_url = str_replace_all(full_url, "\\.\\.", "")) %>%
select(full_url)
urls_to_crawl_df
## # A tibble: 50 x 1
## full_url
## <glue>
## 1 http://www.vondel.humanities.uva.nl/ecartico/persons/414
## 2 http://www.vondel.humanities.uva.nl/ecartico/persons/10566
## 3 http://www.vondel.humanities.uva.nl/ecartico/persons/10567
## 4 http://www.vondel.humanities.uva.nl/ecartico/persons/10568
## 5 http://www.vondel.humanities.uva.nl/ecartico/persons/27132
## 6 http://www.vondel.humanities.uva.nl/ecartico/persons/33780
## 7 http://www.vondel.humanities.uva.nl/ecartico/persons/33781
## 8 http://www.vondel.humanities.uva.nl/ecartico/persons/33782
## 9 http://www.vondel.humanities.uva.nl/ecartico/persons/9203
## 10 http://www.vondel.humanities.uva.nl/ecartico/persons/33052
## # ... with 40 more rows
As you can see, above, it’s really helpful to know about Tidyverse text manipulation, specifically mutate, glue, and pattern matching and regex using the stringr package.
To operationalize this part of the workflow, you want to iterate over the vector full_url found in the urls_to_crawl_df tibble. Then read_html for each name that interest you. Remember that only 50 of the 54 rows in the resutls_df tibble are target names to crawl. So, really, you still have some data wrangling to do. How can you eliminate the four rows in the results_df tibble that are not targets (i.e. names)? Somewhere, below, I’ll also show you how to exclude the four rows of unnecessary/unhelpful information.
Use purrr::map instead of ‘for’ loops. Because purrr is the R/Tidyverse way. ‘For’ loops are fine, but invest some time learning purrr and you’ll be better off. Still, there’s no wrong way to iterate as long as you get the right answer. So, do what works. Below is the Tidyverse/Purrr way….
Now that I have a full list of navigation URLs, each of which represents a web page that has a summary of 50 names/links. My next task is to read the HTML of each URL representing a target-page – in this case a target is a detailed page with structured biographical information about an artist. By reading the URL (importing the HTML) for each target name, I will then have HTML for each individual target person. Of course, I still, then, have to read and parse the HTML of those target-name pages, but I can do that. The scraping (crawling + parsing) works when I have a URL per target person. Because, having a URL for each target-person’s page means I can systematically scrape the web site. In other words, I can crawl the summary navigation to construct a full URL for each name (i.e page.) Then I import (i.e. read_html()) each person’s page and parse the HTML for each person’s information.
But, back to the current task: import the HTML for each summary results page of 50 records…
You should read the notes below, but tl;dr: skip to the CODE below
Note: that, below, I introduce a pause (Sys.sleep()) in front of each read_html() function. This is a common technique for well behaved web scraping. Pausing before each read_html function, avoids overwhelming my target’s server/network infrastructure. If I overwhelm the target server, the server host-people may consider me a DNS attack. If they think I’m a DNS attacker, they might choose to block my computer from crawling their site. If that happens, I’m up a creek. I don’t want that. I want my script to be a well behaved bot-crawler.
Speaking of being a good and honorable scraper-citizen, did I browse the robots.txt page for the site? Did I check the site for a Terms of Service page? Did I look to see if there were any written prohibitions against web crawling, systematic downloading, copyright, or licensing restrictions? I did and you should too. As of this writing, there do not appear to be any restrictions for this site. You should perform these types of good-scraping hygiene steps for every site you want to scrape!
Note: Below, for development purposes, I limit my crawling to 3 results pages of fifty links each: my_url_df$url[1:3]. Be conservative during your code development to avoid appearing as a DNS attacker. Later, when you are ready to crawl your whole target site, you’ll want to remove such limits (i.e. [1:3].) But for now, do everyone a favor and try not to be over confident. Stay in the kiddie pool. Do your development work until you are sure you’re not accidentally unleashing a malicious or poorly constructed web crawler.
Note: Below, I am keeping the original target URL variable, summary_url, for later reference. This way I will have a record of which parsed data results came from which URL web page.
Note: Below, the final result is a tibble with a vector, summary_url, and an associated column of HTML results, each result is stored as a nested R list. That is, a column of data types that are all “lists”, aka a “list column”. Personally I find lists to be a pain. I prefer working with tibbles (aka data frames.). But lists appear often in R data wrangling, especially when scraping with rvest. The more you work with lists, the more you come to tolerate lists for the flexible data type that they are. Anyway, if I were to look at only the first row of results from the html_results column, nav_results_list$html_results[1], I would find a list of the raw HTML from the first summary results page imported via read_html().
tl;dr This is testing. I have three URLs
(html_reults[1:3]), one for each of the first three navigation summary pages. Each summary page will contain the raw HTML for 50 names. I willread_htmleach link, waiting 2 seconds between eachread_html.
nav_results_list <- tibble(
html_results = map(nav_df$url[1:3],
~ {
#url[1:3] - limiting to the first three summary results pages (each page = 50 results)
Sys.sleep(2)
# DO THIS! sleep 2 will pause 2 seconds between server requests to avoid being identified and potentially blocked by my target web server that might see my crawling bot as a DNS attack.
.x %>%
read_html()
}),
summary_url = nav_df$url[1:3]
)
nav_results_list
## # A tibble: 3 x 2
## html_results summary_url
## <list> <glue>
## 1 <xml_dcmn> https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?~
## 2 <xml_dcmn> https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?~
## 3 <xml_dcmn> https://www.vondel.humanities.uva.nl/ecartico/persons/index.php?~
Above, I have three rows of lists, each list is the read_html() results of a summary results page, i.e. each list has 50 URLs and text of my eventual targets.
nav_results_list$summary_url is the URL for each summary page.nav_results_list$html_results is the read_html() results that I want to parse for the href attributes and the html_textRight. Using purrr (map()), I can iterate over the html_results lists, parsing each list with the html_attr() and html_text() functions. It is convenient to keep this parsed data in a tibble as a list: one column for the URL targets ; one column for the html text (which will contain the names of the person for whom the target URL corresponds.) The results are nested lists within a tibble.
results_by_page <- tibble(summary_url = nav_results_list$summary_url,
url =
map(nav_results_list$html_results,
~ .x %>%
html_nodes("#setwidth li a") %>%
html_attr("href")),
name =
map(nav_results_list$html_results,
~ .x %>%
html_nodes("#setwidth li a") %>%
html_text()
)
)
results_by_page
## # A tibble: 3 x 3
## summary_url url name
## <glue> <list> <list>
## 1 https://www.vondel.humanities.uva.nl/ecartico/persons/inde~ <chr [50~ <chr [5~
## 2 https://www.vondel.humanities.uva.nl/ecartico/persons/inde~ <chr [50~ <chr [5~
## 3 https://www.vondel.humanities.uva.nl/ecartico/persons/inde~ <chr [50~ <chr [5~
When I unnest the nested list, I then have a single tibble with 150 URLs and 150 names, one row for each target name. (I also used filter() to do some more regex data cleanup, which I alluded to near the beginning of this document.)
results_by_page %>%
unnest(cols = c(url, name)) %>%
mutate(url = str_replace(url, "\\.\\.", "ecartico")) %>% # fixing a relative path reference by replacing '..' with 'ecartico'
mutate(full_url = glue::glue("http://www.vondel.humanities.uva.nl/{url}"))
## # A tibble: 150 x 4
## summary_url url name full_url
## <glue> <chr> <chr> <glue>
## 1 https://www.vondel.humaniti~ ecartico/~ Hillebrand Boud~ http://www.vondel.h~
## 2 https://www.vondel.humaniti~ ecartico/~ Boudewijn Piete~ http://www.vondel.h~
## 3 https://www.vondel.humaniti~ ecartico/~ Pieter Boudewij~ http://www.vondel.h~
## 4 https://www.vondel.humaniti~ ecartico/~ Boudewyn van d~ http://www.vondel.h~
## 5 https://www.vondel.humaniti~ ecartico/~ Machtelt van d~ http://www.vondel.h~
## 6 https://www.vondel.humaniti~ ecartico/~ Claas van der A~ http://www.vondel.h~
## 7 https://www.vondel.humaniti~ ecartico/~ Claas van der A~ http://www.vondel.h~
## 8 https://www.vondel.humaniti~ ecartico/~ Willem van der ~ http://www.vondel.h~
## 9 https://www.vondel.humaniti~ ecartico/~ Hans von Aache~ http://www.vondel.h~
## 10 https://www.vondel.humaniti~ ecartico/~ Jacobus van Aak~ http://www.vondel.h~
## # ... with 140 more rows
Now, my results_by_page tibble consists of three column variables
summary_url: the link to the Summary Results page which contains the name of each targets-personurl: the relative URL for each target-personname: the name of each target-personNow I can iterate over each row of my results_by_page$url vector to read_html for each target. Then I can parse the raw HTML for each target name page. When I follow the links for each name, I have the raw HTML of each person, in lists, ready to be parsed with the html_nodes, html_text, and html_attr functions.
Now you know how to crawl a website to get a URL for each name found at the source web site. (i.e. crawl the site’s navigation.) The next goal is to read_html() to ingest and parse the HTML for each target.
Web scraping = crawling + parsing
Below is an example of gathering and parsing information for one URL representing one person.
Ingest, i.e. read_html(), each target name, then parse the results of each to mine each target for specific information. In this case, I want the names of each person’s children.
The information gathered is information from the detailed names page about the children of one person in the target database.
Emanuel Adriaenssen has three children:
Children
# http://www.vondel.humanities.uva.nl/ecartico/persons/10579
# schema:children
emanuel <- read_html("http://www.vondel.humanities.uva.nl/ecartico/persons/10579")
children_name <- emanuel %>%
html_nodes("ul~ h2+ ul li > a") %>%
html_text()
children_name
## [1] "Alexander Adriaenssen (1587 - 1661)"
## [2] "Vincent Adriaenssen I (1595 - 1675)"
## [3] "Niclaes Adriaenssen (1598 - ca. 1649)"
There now. I just scraped and parsed data for one target, one person in my list of target URLs. Now use purrr to iterate over each target URL in the list. Do not forget to pause, Sys.sleep(2), between each iteration of the read_html() function.
John Little
Data Science Librarian
Center for Data & Visualization Sciences
Duke University Libraries
https://JohnLittle.info
https://Rfun.library.duke.edu
https://library.duke.edu/data